Exploring which chemical properties influence the quality of red wines.
This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, winch acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## [1] 5.636023
Since we are trying to find the relation between the independent variables and the quality of the wine, we are not interested in a specific wine, so I removed the unneeded X column. Also, I changed the data type of the quality variable to Factor instead of int as it is a categorical variable not a continuous variable, this will make functions treat it appropriately, like in the ggpairs function for example. Before changing the quality to factor I calculated the mean to get an idea about the numerical average of the quality, which in this case is equal to 5.636023.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol quality
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 53
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 5:681
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 6:638
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 7:199
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 8: 18
The first observation from the summary of the variables is that the quality of the wines only varies between 3 to 8 with the majority being ranked as 5 or 6. The second observation is that some of the other variables like residual.sugar and chlorides have a relatively large Max compared to the Median, this indicates that there might be some outliers.
The visual representation of the distribution of the quality variable is aligned with the summary, the quality of the wines only varies between 3 to 8 with the majority being ranked as 5 or 6.
The above histogram shows that the data is skewed to the right.
The above histogram shows that the data is skewed to the right.
The above histogram shows that the data is skewed to the right.
The above histogram shows that the chlorides data is somehow normally distributed with some outliers on the right, as the case with most of the independent variables the plot shows that if we remove the top few % of the chlorides variable it becomes more and more normally distributed, specifically after removing the top 4%. The same applies for the other variables with outliers.
The above histogram shows that the data is skewed to the right.
The above histogram shows that the data is skewed to the right.
The above histogram shows that the pH is somehow normally distributed with some outliers on the right.
The above histogram shows that the data is skewed to the right.
The above plots show the distribution of all the dependent variables and it is clear that most of them are normally distributed with some being skewed to the right. It is also clear that as previously inferred from the summary, there are some outliers in some of the variables.
The data contains 11 independent variables and 1 dependent variable which is the quality. There are 1599 observations in the dataset with no missing values. The independent variables are continuous and have a data type of num, while the dependent variable is discrete and I changed its data type from int to Factor.
The main feature of interest is the dependent variable which is the quality.
The independent variables will be the driving our model for the quality. In the next sections will investigate the relationship between them.
No.
Most of them are normally distributed with some being skewed to the right. It is also clear that as previously inferred from the summary, there are some outliers in some of the variables.
As for operations on the data, I changed the data type of the quality variable to Factor instead of int as it is a categorical variable not a continuous variable, this will make functions treat it appropriately, like in the ggpairs function for example.
To start of the bivariate exploration, I generated the ggpairs plot to visualize all the relationships between the variables. As you can see the quality column is showing box plots as now it is treated as a categorical variable after changing it to a factor. There are some correlations in between the independent variables, and more importantly there are some trends in the box plots of the quality column that need further exploration.
Will start of by exploring the relationships between the independent variables. The first thing that I noticed is the relatively high correlations between the acidity variables (fixed.acidity, volatile.acidity, citric.acid)
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and volatile.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3013681 -0.2097433
## sample estimates:
## cor
## -0.2561309
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
From the above 3 plots it is clear that there are some relationships between these variables. The correlation between the the fixed and volatile acidity is not as strong as the correlation between each of them and the citric acid. But this is just in terms of the strength of the correlation, because in terms of the direction, the fixed acidity is positively correlated while the volatile acidity is negatively correlated. Also, the correlation with the fixed acidity is stronger than the correlation with the volatile acidity. As for the last observation, both these the correlations become weaker with higher levels of citric acid.
From the box plots above it seems that the fixed acidity is relatively a weak factor when it comes to affecting the quality. As for volatile acidity and citric acid, there are negative and positive respective patterns clear when it comes to the quality of wines, also both these patterns become weaker the better the quality gets.
## # A tibble: 6 x 6
## quality mean_alcohol median_alcohol min_alcohol max_alcohol n
## <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 3 9.96 9.93 8.4 11 10
## 2 4 10.3 10 9 13.1 53
## 3 5 9.90 9.7 8.5 14.9 681
## 4 6 10.6 10.5 8.4 14 638
## 5 7 11.5 11.5 9.2 14 199
## 6 8 12.1 12.2 9.8 14 18
Based on the summaries and plots of the relationship between alcohol and quality above there seems to be a pattern between them. From the box plots it is clear that the higher the median of the alcohol the better the quality is, this pattern is true except for the quality value of 5, but it seems that this quality level has the most outliers. From the second line plot it is clear that these outliers pulled the mean a bit higher than the median, but still it doesn’t follow the same pattern observed for the other quality values. Another observation, is the the maximum alcohol level for the top 3 values is 14% which might be an industry standard for higher quality wines.
Other general trends between quality and the independent variables can be identified from the box plots below, usual properties of better quality wines are listed below each plot.
Residual sugar: no clear relation
Chlorides: the less the better
Free sulfur dioxide: none linear relation
Bound sulfur dioxide: none linear relation
Total sulfur dioxide: none linear relation
Density: the less the better
pH: the less the better
Sulphates: the more the better
After exploring the relationship between all the features and the quality, also some of the relationships in between the independent variables here is a summary of what I observed.
There are some linear and non-linear trends between the quality and the other features.
An interesting observation was the related to analyzing the acidity variables, there are correlations between fixed and volatile acidity with the citric acid. After doing some online research I found out that citric acid is a type of fixed acids, but further research in the data documentation, I found out that even though citric acid is considered a fixed acid, in this dataset, the fixed acid is only representing tartaric acid. These observations show how important it is to have domain knowledge about the topic you are trying to analyze, and a reminder that correlation is not always related to causation. Further analysis might in the multivariant section might uncover more details.
Below is a summary of trends between quality and the independent variables: * Fixed acidity: no clear relation * Volatile acidity: the less the better * Citric acid: the more the better * Alcohol: the more the better * Residual sugar: no clear relation * Chlorides: the less the better * Free sulfur dioxide: non-linear relation * Bound sulfur dioxide: non-linear relation * Total sulfur dioxide: non-linear relation * Density: the less the better * pH: the less the better * Sulphates: the more the better
Exploring the relationship between the free and bound sulfur dioxide by different quality levels confirms that it is a non-liner relationship between the quality and sulfur dioxide, as it is clear that the pattern from the highest quality is quite similar to the lowest quality.
From the above plot it is clear that there is a relationship strong negative correlation between the total acidity and the pH as we would expect, but there are no clear patterns of any correlation with the quality of the wines.
Further analysis shows that there are correlations between citric acid, fixed acidity and pH for all quality levels. This information will be useful in avoiding having more than one of them in the modeling part to avoid issues.
##
## Call:
## glm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide +
## density, family = binomial, data = rw)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5947 0.0212 0.0415 0.0807 1.2132
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 487.03180 187.91669 2.592 0.00955 **
## volatile.acidity -7.34560 1.52277 -4.824 1.41e-06 ***
## chlorides -9.35988 3.70543 -2.526 0.01154 *
## total.sulfur.dioxide 0.04207 0.02000 2.103 0.03543 *
## density -478.80713 187.89867 -2.548 0.01083 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 121.428 on 1598 degrees of freedom
## Residual deviance: 80.397 on 1594 degrees of freedom
## AIC: 90.397
##
## Number of Fisher Scoring iterations: 9
The logistic model above shows that volatile.acidity, chlorides, total.sulfur.dioxide and density have significant p values.
From the plots it is clear that some of the independent variables are correlated to each other which is something I had to consider when building the model, but none of the interactions showed a significantly higher effect on the quality of the wines.
My choice of having a logistic regression model over a linear regression is for 2 reasons, the first one is the fact that the quality variable is not a continuous variable, the second reason would be that the data is not equally distributed between the different quality levels. On the other hand, using this logistic regression model has it’s drawbacks, the above model discards the fact the this is an ordinal variable.
The logistic model shows that volatile.acidity, chlorides, total.sulfur.dioxide and density variables have significant p values.
The visual representation of the distribution of the quality variable seems to be normally distributed, the quality of the wines only varies between 3 to 8 with the majority being ranked as 5 or 6, but this isn’t always a good thing. In our case there is an unequal representation of the different wine qualities in this dataset, this will negatively affect the accuracy of our model. Also the model will not be able to predict the part of the scale that is not represented in this dataset.
From the box plots above it seems that the fixed acidity is relatively a weak factor when it comes to affecting the quality. As for volatile acidity and citric acid, there are negative and positive respective patterns clear when it comes to the quality of wines, also both these patterns become weaker the better the quality gets.
An interesting observation regarding the correlations between fixed and volatile acidity with the citric acid is that after doing some online research I found out that citric acid is a type of fixed acids, but further research in the data documentation, I found out that even though citric acid is considered a fixed acid, in this dataset, the fixed acid is only representing tartaric acid. These observations show how important it is to have domain knowledge about the topic you are trying to analyze, and a reminder that correlation is not always related to causation.
This plot shows that there are correlations between citric acid, fixed acidity and pH for all quality levels. These correlations between the independent variables can cause issues with the quality of the models in case we include more than one of them. This information was considered when creating the logistic model.
Reflecting on the exploration, it think this dataset needs to be of better quality to have a more accurate model. Different wine qualities should be equally represented including the missing ones, or at least the scale should be modified. Also, another issue might be that there are differences between the experts grading the wines that can cause some of the observed effects.
The struggle I went through in the beginning was not having domain knowledge about the wine chemical properties, but after some research I was able to get a high level overview of the different properties and their effects on the quality, this helped drastically in understanding a lot of the relationships between all of the variables.
An example of future work with this data can be changing the quality variable from the median to the mean of the different experts, this can give a continuous variable that can be better represented with a linear model. Another thing would be having a better representation of the different wine qualities and updating the scale.